epiAneufinder is an algorithm used for calling Copy Number Variations (CNVs) from single-cell ATAC (scATAC) data. Single-cell open chromatin profiling via the single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) assay has become a mainstream measurement of open chromatin in single-cells. epiAneufinder exploits the read count information from scATAC-seq data to extract genome-wide copy number variations (CNVs) for each individual cell. It allows the addition of single-cell CNV information to scATAC-seq data, without the need of additional experiments, unlocking a layer of genomic variation which is otherwise unexplored.
The main function of epiAneufinder performs the following three steps:
It can be run as follows
epiAneufinder(input="../sample_data/sample.tsv",
outdir="epiAneufinder_results",
blacklist="../sample_data/hg38-blacklist.v2.bed",
windowSize=1e5,
genome="BSgenome.Hsapiens.UCSC.hg38",
exclude=c('chrX','chrY','chrM'),
reuse.existing=TRUE,
title_karyo="Karyogram of sample data",
ncores=4,
minFrags=20000,
minsizeCNV=0,
k=4,
plotKaryo=TRUE)
#> Warning in dir.create(outdir, recursive = TRUE):
#> 'epiAneufinder_results/epiAneufinder_results' already exists
#> Filtering cell without enough covered windows, 16 cells remain.
#> Filtering empty windows, 24558 windows remain.
#> Successfully identified breakpoints
#> A .tsv file with the results has been written to disk.
#> It contains the copy number states for each cell per bin.
#> 0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'.
#> Successfully plotted karyogramIn the example above, a fragment file is supplied, but different input formats are possible for EpiAneufinder. Alternatives are a directory with bam files per cell or an already generated count matrix. The file type is identified automatically. The fragment file needs to be a single file with the ending tsv or bed. For the count matrix, a directory with three files is required, called matrix.mtx, barcodes.tsv and peaks.bed. The bam files need all to be in one directory with a bam index file each.
The filtering options are dependent on the input file type. In case,
a fragment file is provided the parameter minFrags filters
cells with too less fragments. In case, a directory with bam files is
provided the parameter mapqFilter filters low quality
reads. For the option to provide directly a count matrix, no filtering
is currently implemented.
Note of caution We compared the results of CNV calling with the peak matrix compared to fragment file as input. We used the SNU601 dataset as in the publication and compared with the scWGS dataset as groundtruth. The correlation between the CNVs is worse when using the matrix as input (0.74 from 0.85). We loose reads that are not in peak regions so overall we get less coverage over the genome. The result is that we can miss some gains and losses that would otherwise have been identified with the fragment input. So we would suggest a bit of caution in the interpretation of the results, but the comparison with the groundtruth the method still gives valid results.
The example is run on the genome hg38, but any other genome available
at BSgenome can be used correspondingly, after installing the matching
bioconductor package. Corresponding to the genome version, the correct
blacklist file need to be provided to the function and the
corrected naming of the excluded chromosomes (variable
exclude).
The window size parameter defines the region size for the CNV predictions, counts within each bin will be aggregated and one CNV annotation for each bin is obtained.
The minsizeCNV parameter regulates the minimal number of
consecutive bins required for a CNV annotation, i.e. each CNV will
contain at least minsizeCNV bins. Setting the variable to 0
disables the restriction that a CNV needs to have a certain size and
also CNVs with one bin can be called.
The k parameters regulates the number of iterations of
the segmentation algorithm. A maximum of 2^k CNVs can be
found per chromsomes in k iterations. A large
k parameter increases however the runtime substantially, so
we set the default value to 4.
The epiAneufinder package contains different downstream functions for a more in depth analysis of the CNV predictions, in particular to split the cells into different subclones and different additional plotting functions.
Subclones in the dataset can be identified based on hierarchical
clustering of the identified CNVs. The dendogram is cut at a chosen
depth tree_depth.
#Load result table
res_table<-read.table("epiAneufinder_results/epiAneufinder_results/results_table.tsv")
#Need to reformat column names as dash are replaced by points in R
colnames(res_table)<-gsub("\\.","-",colnames(res_table))
subclones<-split_subclones(res_table,tree_depth=4,
plot_tree=TRUE,
plot_path="epiAneufinder_results/epiAneufinder_results/subclones.pdf",
plot_width=4,
plot_height=3)
#Data frame with the subclone split
head(subclones)
#> cell subclone
#> cell-TCAGCTCTCGCGTTCT-1 cell-TCAGCTCTCGCGTTCT-1 1
#> cell-TACGCAAAGTCGTGAG-1 cell-TACGCAAAGTCGTGAG-1 1
#> cell-TAACTTCTCCCGGGTA-1 cell-TAACTTCTCCCGGGTA-1 1
#> cell-ACCATCCGTTCTGAAC-1 cell-ACCATCCGTTCTGAAC-1 2
#> cell-CTCAGCTAGAGACTCG-1 cell-CTCAGCTAGAGACTCG-1 2
#> cell-ACAATCGTCTCATCCG-1 cell-ACAATCGTCTCATCCG-1 3Additional annotations can be added to the karyograms generated by
epiAneufinder, for example cell type annotations or the subclonal
distribution identified by split_subclones.
We recommend to save the karyogram as a png, defined via the variable
plot_path, as pdf versions can get quite large.
#Load result table
res_table<-read.table("epiAneufinder_results/epiAneufinder_results/results_table.tsv")
#Need to reformat column names as dash are replaced by points in R
colnames(res_table)<-gsub("\\.","-",colnames(res_table))
annot_dt<-data.frame(cell=subclones$cell,
annot=paste0("Clone",subclones$subclone))
plot_karyo_annotated(res_table=res_table,
plot_path="epiAneufinder_results/epiAneufinder_results/karyo_annotated.png",
annot_dt=annot_dt)The predictions of epiAneufinder on individual cells can be further
explored by evaluating the count profile of each cell. The set of cells
for detailed investigation can be either subsampled randomly by
assigning an integer value to the parameter selected_cells,
then this number of cells are sample. Or a vector of barcodes can be
given directly to selected_cells.
outdir<-"epiAneufinder_results/epiAneufinder_results/"
#Version 1 - Selected a specific set of barcodes, e.g. based on subclones
cell_barcodes<-subclones$cell[subclones$subclone==1]
cell_barcodes
#> [1] "cell-TCAGCTCTCGCGTTCT-1" "cell-TACGCAAAGTCGTGAG-1"
#> [3] "cell-TAACTTCTCCCGGGTA-1" "cell-TCGCAGGCATCCTCGT-1"
#> [5] "cell-CGGACCATCTCTGTTA-1" "cell-TCAATTCTCATACTTC-1"
#> [7] "cell-TCGGGACTCTAAGGTC-1" "cell-TACATTCTCCATCATT-1"
#> [9] "cell-TATGTTCCAGGACTGA-1" "cell-TGACTCCCAAGTCTGT-1"
plot_single_cell_profile(outdir,
threshold_blacklist_bins=0.85,
selected_cells=cell_barcodes,
plot_path=paste0(outdir,"individual_tracks_clone1.pdf"),
plot_width=25,
plot_height=15)
#> quartz_off_screen
#> 2
#Version 2 - Randomly subsampling a certain number of cells
plot_single_cell_profile(outdir,
threshold_blacklist_bins=0.85,
selected_cells=3,
plot_path=paste0(outdir,"individual_tracks_random_subsample.pdf"),
plot_width=25,
plot_height=15)
#> quartz_off_screen
#> 2